[1] 36 88 66
Every research project aims to answer a research question (or multiple questions).
Do ECU students who exercise regularly have a higher GPA?
Each research question aims to examine a population.
Population for this research question is ECU students.
It is impossible to study the whole population related to a research question.
A sample \(n\) is a subset of the population \(N\).
The Goal: Select a representative sample to generalize to the broader population.
What is representative?
Data quality matters more than data quantity
Many anthropological studies (or similar) are convenience based.
Every member of a population has an equal chance of being selected.
To Generalize:
Similar to a simple random sample BUT intervals are chosen at regular intervals.
# 1. Create a population (e.g., a vector of 1 to 1000)
population <- 1:1000
# 2. Define the desired sample size
sample_size <- 100
# 3. Calculate the sampling interval (k)
N <- length(population) # Population size
k <- N / sample_size
# If k is not an integer, you might use ceiling(N/n) and adjust the logic
# 4. Choose a random starting point (r) between 1 and k
set.seed(123) # Optional: for reproducible results
start_point <- sample(1:k, 1)
# 5. Select every k-th element starting from the random start point
systematic_sample_indices <- seq(from = start_point, to = N, by = k)
systematic_sample <- population[systematic_sample_indices]
# 6. View the first few elements and the dimension of the sample
head(systematic_sample)[1] 3 13 23 33 43 53
[1] 100
set.seed(123)
population <- data.frame(
Supermarket = paste("Supermarket", 1:1000, sep = "_"),
CustomerSatisfaction = rnorm(1000, mean = 75, sd = 10)
)
selected_supermarkets <- sample(population$Supermarket, size = 10, replace = FALSE)
sampled_data <- population[population$Supermarket %in% selected_supermarkets, ]
head(sampled_data) Supermarket CustomerSatisfaction
203 Supermarket_203 72.34855
225 Supermarket_225 71.36343
255 Supermarket_255 90.98509
354 Supermarket_354 76.16637
457 Supermarket_457 86.10277
554 Supermarket_554 77.49825
set.seed(123)
region <- data.frame(
Neighborhood = paste("Neighborhood", 1:500, sep = "_"),
AverageIncome = rnorm(500, mean = 50000, sd = 10000)
)
households <- data.frame(
Neighborhood = rep(sample(region$Neighborhood, size = 500, replace = TRUE), each = 20),
HouseholdID = rep(1:20, times = 500),
EmploymentStatus = sample(c("Employed", "Unemployed"), size = 10000, replace = TRUE)
)
selected_neighborhoods <- sample(region$Neighborhood, size = 5, replace = FALSE)
sampled_households <- households[households$Neighborhood %in% selected_neighborhoods, ]
head(sampled_households) Neighborhood HouseholdID EmploymentStatus
1981 Neighborhood_302 1 Unemployed
1982 Neighborhood_302 2 Employed
1983 Neighborhood_302 3 Employed
1984 Neighborhood_302 4 Employed
1985 Neighborhood_302 5 Unemployed
1986 Neighborhood_302 6 Unemployed
set.seed(123)
states <- data.frame(
State = paste("State", 1:50, sep = "_"),
Population = sample(1000000:5000000, 50, replace = TRUE)
)
counties <- data.frame(
State = rep(sample(states$State, size = 50, replace = TRUE), each = 20),
County = rep(paste("County", 1:20, sep = "_"), times = 50),
VaccinationRate = rnorm(1000, mean = 70, sd = 5)
)
selected_states <- sample(states$State, size = 3, replace = FALSE)
selected_counties <- sample(counties$County[counties$State %in% selected_states], size = 5, replace = FALSE)
sampled_vaccination_centers <- counties[counties$County %in% selected_counties, ]
head(sampled_vaccination_centers) State County VaccinationRate
8 State_32 County_8 70.37428
11 State_32 County_11 66.86024
13 State_32 County_13 70.81309
15 State_32 County_15 67.68222
19 State_32 County_19 70.91839
28 State_46 County_8 68.84869
How do we infer future events or population characteristics?
In a random process there is more than one possible outcome.
The set of all possible outcomes of a random process.
An event is a subset of the sample space.
Examples with a 6-sided die:
A represent the event that a single roll die results in an even number.
A = {2, 4, 6}B represent the event that a single roll die results in an odd number.
B = {1, 3, 5}C represent the event that a single roll die results in a prime number.
C = {2, 3, 5}The set of all outcomes in the sample space that are not in the event itself.
Example:
C represent the event that a single roll die results in a prime number.
C = {2, 3, 5}= {1, 4, 6}A represent the event that a single roll die results in an even number.
A = {2, 4, 6}B represent the event that a single roll die results in an odd number.
B = {1, 3, 5}C represent the event that a single roll die results in a prime number.
C = {2, 3, 5}Events \(A\) and \(B\) are mutually exclusive because an outcome cannot be both even + odd.
Events \(A\) and \(C\) are not mutually exclusive because the outcome 2 is both even + prime.
| Description | Notation | Reading | Elements |
|---|---|---|---|
| Union | \(A \cup C\) | A or C | {2, 3, 4, 5, 6} |
| Intersection | \(A \cap C\) | A and C | {2} |
set.seed(1)
OBV <- 1:10
Dist1 <- NULL
Dist9 <- NULL
Dist16 <- NULL
Dist25 <- NULL
Dist36 <- NULL
count = 100
while(count > 0){Dist1 <- c(Dist1,sample(OBV, 1, replace = TRUE)); count <- count - 1}
count = 100
while(count > 0){Dist9 <- c(Dist9,mean(sample(OBV, 9,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist16 <- c(Dist16,mean(sample(OBV, 16,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist25 <- c(Dist25,mean(sample(OBV, 25,replace = TRUE) ) ); count <- count - 1}
count = 100
while(count > 0){Dist36 <- c(Dist36,mean(sample(OBV, 36,replace = TRUE) ) ); count <- count - 1}
Dist.df <- data.frame(Size = factor(rep(c(1,9,16,25,36), each=100)), Sample_Means = c(Dist1, Dist9, Dist16, Dist25, Dist36) )
ggplot(Dist.df, aes(Sample_Means, fill = Size)) + geom_histogram() + facet_grid(. ~ Size)The likelihood of some event occurring…
The probability of an outcome is defined to be the proportion of times the outcome is observed under high number of repetitions of the random process.
Assume that we are repeating the random process of a coin flip and are recording \(X\), the number of heads in \(n\) coin flips. Then:
\[ P(H) = \lim_{n\to\infty}\frac{X}{n} \]
\[ P(H) =\frac{1}{2} \]
set.seed(42) # for reproducibility
# Number of coin flips
n_flips <- 10000
# Simulate rolling a fair 6-sided die
flips <- sample(c("H", "T"), size = n_flips, replace = TRUE)
# Compute cumulative mean of rolling a '1'
cumulative_mean <- cumsum(flips == "H") / (1:n_flips)
# Plot convergence
plot(1:n_flips, cumulative_mean, type = "l", col = "blue", lwd = 2,
xlab = "Number of Flips", ylab = "Proportion of Heads",
main = "Law of Large Numbers: Convergence to 1/2")
abline(h = 1/2, col = "red", lty = 2, lwd = 2) # Reference line at 1/2The probability of an outcome is a degree of belief or reasonable expectation quantifying one’s state of knowledge based on observed events and prior knowledge.
set.seed(42) # For reproducibility
# Number of dice rolls
n_flips <- 10000
# Simulate rolling a fair 6-sided die
flips <- sample(c("H", "T"), size = n_flips, replace = TRUE)
# Prior: Beta(1,5) (weak prior belief about p = 1/6)
alpha <- 1 # prior successes (rolling a 1)
beta <- 2 # prior failures (rolling 2-6)
# Store posterior mean estimates
posterior_means <- numeric(n_flips)
# Bayesian updating
for (i in 1:n_flips) {
alpha <- alpha + (flips[i] == "H") # Increase count if roll == 1
beta <- beta + (flips[i] != "H") # Increase count otherwise
posterior_means[i] <- alpha / (alpha + beta) # Compute posterior mean
}
# Plot posterior mean convergence
plot(1:n_flips, posterior_means, type = "l", col = "blue", lwd = 2,
xlab = "Number of Rolls", ylab = "Posterior Mean of p(rolling a 1)",
main = "Bayesian Law of Large Numbers")
abline(h = 1/2, col = "red", lty = 2, lwd = 2) # True probability reference line\[ 0 \leq P(A) \leq 1 \]
\[ P(S) = 1 \]
\[ \bigcup\limits_{i=1}^{\infty} A_{i} = \Sigma_{i =1}^\infty P(A_i) \]
\[ P(A) + P(A^C) = 1 \]
If we know the the probability that somebody owns a bike is 0.08, then we would know that the probability that somebody does not own a bike is 0.92.
Two processes are independent if knowing about the outcome of one does not help predict the outcome of the other.
Example:
2 flips of a coin
If events \(A\) and \(B\) are from two independent processes:
\[ P(A \cap B) = P(A) \times P(B) \]
The probability of getting 2 heads in 2 flips:
\[ \frac{1}{2} \times \frac{1}{2} = \frac{1}{4} \]
If mutually exclusive \[ P(B\cup C) = P(B) + P(C) \]
If not mutually exclusive \[ P(B\cup C) = P(B) + P(C) - P(B \cap C) \]
Example: What is the probability of a heart or a king?
\[ P(H\cup K) = \frac{13}{52} + \frac{4}{52} - \frac{1}{52} = \frac{16}{52} \]
The General Social Survey (GSS) is a sociological survey that has been regularly conducted since 1972. It is a comprehensive survey that provides information on experiences of residents of the United States.
| Belief in Life After Death | ||||
|---|---|---|---|---|
| Yes |
No |
Total |
||
|
College Science Course |
Yes | 375 | 75 | 450 |
| No | 485 | 115 | 600 | |
| Total | 860 | 190 | 1050 | |
Let \(B\) represent an event that a randomly selected person in this sample believes in life after death.
Let \(C\) represent an event that a randomly selected person in this sample took a college level science course.
A numeric outcome of a random process is called a random variable.
Describes the probability of 2 or more random variables occurring.
Note that events \(B\) and \(C\) are not mutually exclusive. A randomly selected person can believe in life after death and might have taken a college science course. \(B \cap C \neq \emptyset\)
\[ P(B \cap C) = \frac{375}{1050} \]
Note that \(P(B\cap C) = P(C\cap B)\). Order does not matter.
\(P(B)\) represents a marginal probability. So to does \(P(C)\), \(P(B^C)\), and \(P(C^C)\). In order to calculate these probabilities, we could only use values in the margins of the contingency table:
\[ P(B) = \frac{860}{1050} \]
\[ P(C) = \frac{450}{1050} \]
\(P(B|C)\) represents a conditional probability. So to does \(P(B^C|C)\), \(P(C|B)\), and \(P(C|B^C)\). To calculate the probabilities we focus on the row or the column of the given information. We reduce the sample space to this given information.
Probability that a randomly selected person believes in life after death given that they have taken a college science course:
\[ P(B|C) = \frac{375}{450} \]
One is isolating the effects of a specific variable.
Order Matters!
Likelihood Function
\(P(Y|\theta)\) or \(\mathcal{L}\)\((Y|\theta)\)
The probability of the observed data, given the hypothesis / parameters.
Does NOT sum to 1
Most scientific questions stem from the likelihood function.
Distinction: Downstream interpretations relate to how one constructs (and understands a likelihood).
Option A: \(P(Data | \mathbf{Hypothesis})\) - Hypothesis fixed, data varies - Frequentist
Option B: \(P(\mathbf{Data} | Hypothesis)\) - Data fixed, hypothesis varies - Bayesian
Mathematical functions that define the likelihood of different outcomes of a random variable
Remember - they must normalize to 1!
Input = Random variable (e.g., Flipping a heads OR individual stature)
Output = Probability of that event occurring given the context.
\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]
Probability Mass Function (PMF)
Countable number of distinct values
\[ P(X = x) = f(x) \]
Let \(X\) be the proportion of correct responses on a test with 100 questions. \(S_x = {0.00,0.01,0.02,...,0.98,0.99,1.00}\).
Random process: 3 flips of a single coin
Sample Space: \({𝐻𝐻𝐻,𝐻𝐻𝑇,𝐻𝑇𝐻,𝑇𝐻𝐻,𝐻𝑇𝑇,𝑇𝐻𝑇,𝑇𝑇𝐻,𝑇𝑇𝑇}\)
Let X be a discrete random variable that represents the total number of heads in 3 coin flips.
# 1. Define the possible outcomes (0, 1, 2, or 3 heads)
x <- c(0, 1, 2, 3)
# 2. Define the probabilities for each outcome
# P(X=0) = 1/8, P(X=1) = 3/8, P(X=2) = 3/8, P(X=3) = 1/8
probs <- c(1/8, 3/8, 3/8, 1/8)
# 3. Create the plot
plot(x, probs,
type = "h", # "h" for high-density vertical lines
lwd = 5, # Line width
col = "steelblue", # Color
main = "PMF of Total Heads in 3 Coin Flips",
xlab = "Number of Heads (x)",
ylab = "P(X = x)",
ylim = c(0, 0.5), # Set y-axis limits
xaxt = "n") # Turn off default x-axis to customize
# 4. Add custom x-axis and points on top of the lines
axis(1, at = 0:3)
points(x, probs, pch = 16, cex = 2, col = "steelblue")Probability Density Function
The probability of some continuous random variable within some sample space.
Uncountably infinite.
\[ P(a \leq X \leq b) = \int_a^b f(x)dx \]
\[ \int_{x \ \epsilon \ S_X} f(x)dx = 1 \]
All PMFs / PDFs have a CDF.
Mass / density functions return a single probability value based on 1 random variable.
?distributions
There are 21 probability distributions in base R.
Each function is governed by parameters
All have the following construction:
d-xxx \(\rightarrow\) PDF/PMFp-xxx \(\rightarrow\) CDFq-xxx \(\rightarrow\) quantile function / inverse cdfr-xxx \(\rightarrow\) random number generator[1] 0.65 0.35
What if you want the probability of getting a 60 or lower?
What about greater than 60? [Excluding 60]
Greater than or equal to 60
What if I want the probability between 2 areas? Say 60 and 40.
[1] 0.5379066
What if I want the 95th quantile range?
library(ggplot2)
# Parameters
n <- 100
p <- 0.6
# Quantiles
q_low <- qbinom(0.025, size = n, prob = p)
q_high <- qbinom(0.975, size = n, prob = p)
# PMF data frame
x_vals <- 0:n
df <- data.frame(
x = x_vals,
prob = dbinom(x_vals, size = n, prob = p),
region = case_when(
x_vals < q_low ~ "Lower Tail (2.5%)",
x_vals > q_high ~ "Upper Tail (2.5%)",
TRUE ~ "Middle 95%"
)
)
# Plot
ggplot(df, aes(x = x, y = prob, fill = region, color = region)) +
geom_segment(aes(xend = x, yend = 0), linewidth = 0.8) +
geom_point(size = 2) +
geom_vline(xintercept = c(q_low, q_high), linetype = "dashed", color = "grey30") +
annotate("label", x = q_low, y = max(dbinom(x_vals, n, p)) * 0.9,
label = paste0("q = ", q_low, "\n(p = 0.025)"), size = 3.5, color = "#3b82f6") +
annotate("label", x = q_high, y = max(dbinom(x_vals, n, p)) * 0.9,
label = paste0("q = ", q_high, "\n(p = 0.975)"), size = 3.5, color = "#ef4444") +
scale_color_manual(values = c(
"Lower Tail (2.5%)" = "#3b82f6",
"Middle 95%" = "#22c55e",
"Upper Tail (2.5%)" = "#ef4444"
)) +
scale_fill_manual(values = c(
"Lower Tail (2.5%)" = "#3b82f6",
"Middle 95%" = "#22c55e",
"Upper Tail (2.5%)" = "#ef4444"
)) +
labs(
title = "Binomial PMF with 95% Central Region",
subtitle = paste0("X ~ Binomial(n = 100, p = 0.6) | 95% interval: [", q_low, ", ", q_high, "]"),
x = "x",
y = "P(X = x)",
fill = NULL,
color = NULL
) +
xlim(30, 90) + # zoom in to relevant range
theme_minimal(base_size = 14) +
theme(legend.position = "top")Simulate the number of defective items in 100 batches of 50 items each, where the probability of a single item being defective is 0.02:
library(ggplot2)
library(dplyr)
# ============================================================
# SETUP: Human Height ~ Normal(mean = 170cm, sd = 10cm)
# ============================================================
mu <- 170 # mean height in cm
sigma <- 10 # standard deviation
# ============================================================
# 1. rnorm() — SIMULATE data (like sampling real people)
# ============================================================
# "If I randomly sampled 1000 people, what heights would I see?"
set.seed(42)
sample_heights <- rnorm(n = 1000, mean = mu, sd = sigma)
df_sample <- data.frame(height = sample_heights)
p1 <- ggplot(df_sample, aes(x = height)) +
geom_histogram(aes(y = after_stat(density)), bins = 40,
fill = "#93c5fd", color = "white") +
geom_density(color = "#1d4ed8", linewidth = 1) +
stat_function(fun = dnorm, args = list(mean = mu, sd = sigma),
color = "#ef4444", linewidth = 1, linetype = "dashed") +
labs(title = "rnorm() — Simulated Sample (n = 1,000)",
subtitle = "Blue = observed density | Red dashed = true Normal(170, 10)",
x = "Height (cm)", y = "Density") +
theme_minimal(base_size = 13)# ============================================================
# 2. dnorm() — DENSITY (height of the curve at a point)
# ============================================================
# "How relatively likely is it to be exactly 170cm vs 190cm?"
# Note: this is NOT a probability — it's a density value.
# For continuous distributions, P(X = x) = 0 exactly.
x_vals <- seq(130, 210, length.out = 500)
df_density <- data.frame(
x = x_vals,
density = dnorm(x_vals, mean = mu, sd = sigma)
)
# Points to highlight
highlight <- data.frame(
x = c(170, 180, 190),
y = dnorm(c(170, 180, 190), mean = mu, sd = sigma),
label = c("dnorm(170) = 0.0399",
"dnorm(180) = 0.0242",
"dnorm(190) = 0.0054")
)
p2 <- ggplot(df_density, aes(x = x, y = density)) +
geom_line(color = "#1d4ed8", linewidth = 1.2) +
geom_point(data = highlight, aes(x = x, y = y), color = "#ef4444", size = 3) +
geom_label(data = highlight, aes(x = x, y = y, label = label),
nudge_y = 0.003, size = 3.2) +
labs(title = "dnorm() — Probability Density Function (PDF)",
subtitle = "Height of the curve — higher = more 'likely' region, but not a probability",
x = "Height (cm)", y = "Density") +
theme_minimal(base_size = 13)# ============================================================
# 3. pnorm() — CDF & P-VALUES
# ============================================================
# "What is the probability of being shorter than X cm?"
# This is where p-values live — they are areas under the curve.
# --- One-tailed: P(Height < 185) ---
cutoff <- 185
p_lower <- pnorm(cutoff, mean = mu, sd = sigma, lower.tail = TRUE)
# "What % of people are shorter than 185cm?"
df_shade_lower <- df_density %>% filter(x <= cutoff)
p3 <- ggplot(df_density, aes(x = x, y = density)) +
geom_area(data = df_shade_lower, aes(x = x, y = density),
fill = "#3b82f6", alpha = 0.4) +
geom_line(color = "#1d4ed8", linewidth = 1.2) +
geom_vline(xintercept = cutoff, linetype = "dashed", color = "#ef4444") +
annotate("label", x = 158, y = 0.025,
label = paste0("P(X < 185) = ", round(p_lower, 4),
"\npnorm(185, 170, 10, lower.tail=T)"),
size = 3.5, color = "#1d4ed8") +
labs(title = "pnorm() — CDF: P(Height < 185cm)",
subtitle = "Shaded area = probability = 0.9332 → ~93% of people are shorter than 185cm",
x = "Height (cm)", y = "Density") +
theme_minimal(base_size = 13)# ============================================================
# 4. qnorm() — QUANTILES (inverse of pnorm)
# ============================================================
# "What height separates the top 5% from everyone else?"
# This is the CRITICAL VALUE concept in hypothesis testing.
q_95 <- qnorm(0.95, mean = mu, sd = sigma) # one-tailed 5%
q_025 <- qnorm(0.025, mean = mu, sd = sigma) # two-tailed 2.5% lower
q_975 <- qnorm(0.975, mean = mu, sd = sigma) # two-tailed 2.5% upper
df_shade_95 <- df_density %>% filter(x >= q_95)
df_shade_low <- df_density %>% filter(x <= q_025)
df_shade_high <- df_density %>% filter(x >= q_975)
p5 <- ggplot(df_density, aes(x = x, y = density)) +
geom_area(data = df_shade_low, aes(x = x, y = density), fill = "#f97316", alpha = 0.5) +
geom_area(data = df_shade_high, aes(x = x, y = density), fill = "#f97316", alpha = 0.5) +
geom_line(color = "#1d4ed8", linewidth = 1.2) +
geom_vline(xintercept = c(q_025, q_975), linetype = "dashed", color = "#f97316") +
annotate("label", x = q_025, y = 0.015,
label = paste0("qnorm(0.025)\n= ", round(q_025, 1), "cm"),
size = 3.2, color = "#7c2d12") +
annotate("label", x = q_975, y = 0.015,
label = paste0("qnorm(0.975)\n= ", round(q_975, 1), "cm"),
size = 3.2, color = "#7c2d12") +
annotate("text", x = 170, y = 0.022,
label = "Middle 95%\n(±1.96 SDs)", size = 3.5, color = "#1d4ed8") +
labs(title = "qnorm() — Quantiles / Critical Values",
subtitle = "qnorm(0.025) & qnorm(0.975) give the 95% interval — the origin of the ±1.96 rule!",
x = "Height (cm)", y = "Density") +
theme_minimal(base_size = 13)